Team Member: SWETHA KOLLOJU
Advisors: Dr. Sarah Bratt, Dr. Cristian Román-Palacios
This project investigates how scientific software use—specifically in evolutionary biology—clusters across geographic, linguistic, and academic genealogical lines. Scientific software is not merely a technical tool; it acts as a gatekeeper of community practice and knowledge exchange. By tracing software mentions across 197,000+ scholarly articles (1990–2024), we ask:
The core of the study is on phylogenetic software used in building evolutionary trees, such as BEAST, RAxML, MrBayes, TNT, and many others. Despite serving the same scientific purpose, these tools have been adopted unevenly. This project aims to uncover how software choice becomes a social pattern, potentially shaping careers and impacting the innovation pipeline of evolutionary science.
We built a computational pipeline that processes large-scale full-text articles and links them with structured author metadata. The key phases of the methodology are captured in the visual below:
1. Define Research Questions
Frame the problem: How do geographic, linguistic, and academic lineages shape scientific software usage?
2. Collect Data
Downloaded 197,677 full-text articles (1990–2024) from Constellate (JSTOR + Portico) using the search string phylogenetic OR systematics OR Cladistic analysis, limited to article documents.
3. Extract Software Mentions
Applied a dictionary- and regex-based approach (CZ software method) to identify phylogenetic software names in the full text, extracting the sentence context for validation.
4. Merge with OpenAlex
Cross-referenced each DOI with the OpenAlex API to retrieve author affiliations, countries, and institutional metadata aggregated from multiple sources (MS Academic, Crossref, ORCID, etc.).
5. Analyze Patterns
Constructed networks (author–software, language–software, country–software), performed fixed-effects regressions on innovation and career outcomes, and conducted chi-square and ANOVA tests on software usage over decades.
We collected data from the Constellate website, identified articles containing specific software mentions, and filtered the dataset. Using the DOIs, we extracted author details via OpenAlex, including author IDs and country affiliations. We then prepared an edge list combining:
This consolidated dataset formed the basis for all subsequent network analyses.
This interactive network graph presents the collaborative landscape between authors and the phylogenetic software they use, extracted from ~5,000 scientific papers. It offers a visual summary of:
Data Source: author_id_edgelist_pretty_with_term.csv
A pre-processed dataset listing two authors (author1, author2) per paper, software mentions (Term), often multiple per DOI.
Data Filtering: Only papers with documented software mentions are retained, up to 5,000 DOIs.
Node and Edge Logic:
This bipartite network visualization shows the relationship between publication languages and phylogenetic software tools used in scientific literature.
This dual-panel bar chart visualizes how frequently various phylogenetic software tools are used across different publication languages (relative frequency %).
This subplot isolates rarely used tools (≤1% frequency) to reveal niche software such as:
It ensures outlier detection and highlights lesser-known platforms otherwise overshadowed.
This graph visualizes the relationship between phylogenetic software tools and their publication-year usage patterns, styled as a chord-like circular network. It shows how different tools are distributed over time, emphasizing both historical trends and recent popularity.
This interactive visualization reveals how phylogenetic software tools are used across different countries, based on mention frequencies in scientific publications. It shows global distribution, regional preferences, and dominant software footprints.
Edges connect countries to software based on mention counts. Thicker edges = higher mention frequency.
Our study shows that software scientists use for building evolutionary trees isn't just about features or performance—it’s also shaped by where they live, what language they speak, and who they work with. By analyzing nearly 200,000 scientific articles, we found clear patterns: certain tools like MEGA and NDE are used almost everywhere, while others like TNT or PAUP are more popular in specific countries or communities.
We also saw that English dominates scientific publishing, but tools like PAST, DELTA, and BEAST show up in other languages too—though less often. Some software are only used in niche regions, suggesting that language and geography still influence access and adoption.
Over time, we’ve moved from older parsimony-based tools to newer Bayesian and likelihood-based ones like MrBayes, RAxML, and IQ-TREE. These changes reflect shifts in the field but also in how scientists train and work together.
Looking at how authors and software are connected, it’s clear that people often cluster around specific tools—probably influenced by mentors, departments, or even hiring trends. This kind of clustering can help build strong support networks, but it might also make it harder for new ideas or people to break in.
In short, scientific software is more than just code. It reflects the structure of the scientific community: how ideas spread, how people collaborate, and how careers evolve. To truly support innovation and inclusion, we need to pay attention not just to what tools are used—but who’s using them, where, and why.